Sean Davis, MD, PhD
Center for Cancer Research National Cancer Institute @seandavis12
https://seandavi.github.io
Background and introduction
Two example workflows
Additional functionality
These slides were written in R and all plots and results are generated from code in the slides–reproducible research in action!
Analysis and comprehension of high-throughput biological data
Bioconductor is well-used and well-respected
Bioconductor is a Toolkit for high-throughput biological data science built with reproducible computational research as a core value.
Bioconductor software supports all aspects the biological data science workflow.
Use the GEOquery package to fetch data about GSE103512.
library(GEOquery)
gse = getGEO("GSE103512", destdir='/tmp')[[1]]
gse## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 54715 features, 280 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM2772660 GSM2772661 ... GSM2772939 (280 total)
## varLabels: title geo_accession ... weight:ch1 (73 total)
## varMetadata: labelDescription
## featureData
## featureNames: 1007_PM_s_at 1053_PM_at ... AFFX-TrpnX-M_at (54715
## total)
## fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16
## total)
## fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: GPL13158
Examine two variables of interest, cancer type and tumor/normal status.
library(xtable)
print(xtable(with(pData(gse),table(`cancer type:ch1`,`normal:ch1`))),type='html')| no | yes | |
|---|---|---|
| BC | 65 | 10 |
| CRC | 57 | 12 |
| NSCLC | 60 | 9 |
| PCA | 60 | 7 |
Information about features measured are also included.
Filter gene expression by variance to find most informative genes.
sds = apply(exprs(gse),1,sd)
dat = exprs(gse)[order(sds,decreasing = TRUE)[1:500],]Perform multidimensional scaling and prepare for plotting.
mdsvals = cmdscale(dist(t(dat)))
mdsvals = as.data.frame(mdsvals)
mdsvals$Type=factor(pData(gse)[,'cancer type:ch1'])
mdsvals$Normal = factor(pData(gse)[,'normal:ch1'])And do the plot.
library(ggplot2)
ggplot(mdsvals, aes(x=V1,y=V2,shape=Normal,color=Type)) +
geom_point(size=4, alpha=0.6) + theme(text=element_text(size = 18))Use the GenomicDataCommons package to find and download variants from the TCGA cutaneous melanoma dataset.
library(GenomicDataCommons)
fnames = files() %>%
GenomicDataCommons::filter(~ cases.project.project_id=='TCGA-SKCM' &
data_type=='Masked Somatic Mutation' &
data_format=='MAF' &
analysis.workflow_type=='MuTect2 Variant Aggregation and Masking') %>%
ids() %>%
gdcdata()And now take those data directly to maftools for analysis and visualization.
library(maftools)
melanoma = read.maf(maf = fnames[1])v = oncodrive(maf = melanoma, AACol = 'HGVSp_Short', minMut = 5, pvalMethod = 'zscore')
plotOncodrive(v,labelSize=40,fdrCutOff = 0.1)Hubs and data packages give users access to curated and versioned public data.
Contact me: